So welcome everybody to our lecture Deep Learning today almost on time and we want to continue with our journey through the through the realms of deep learning and today we want to talk about common practices and they're not just common practices they're actually fairly important.
Everybody here knows them so this is really something sometimes people don't talk about that too much but I think it's really important that everybody taking this class knows about those practices and in particular it will be about training strategies optimization and learning rate then architecture selection hyper parameter optimization on sampling class imbalance and evaluation.
So everybody should know about those I don't think that these is not a lot of theory that we are discussing today but this is really important when you do your evaluations and if you do any of those wrong you may very easily end up with getting results that look really fantastic but that are rather optimistic let's say it that way.
Okay so far we learned the nuts and bolts on how to train a network we talked about fully connected and convolutional layers we talked about activation functions we talked about different loss functions how we tell the network what's right or wrong then we had a little bit of optimization we looked into different ways of doing the gradient descent then acceleration techniques how you can get to your minimum quite a bit faster.
And then we looked into regularization just in the last lecture so today we want to talk about the common practices and how to choose the architecture train and evaluate the deep network.
So this is really when you when you get going and once you have this set of slides you can easily go ahead and train your own networks and first things first the test data this is where you put the test data you start designing everything no you don't start designing anything the first thing you do is you collect your data and once you have your data you take your test data and you put it in here.
You put it into this into this vault into the safe the test data is something we only look at the very end after we finished everything that we are talking about today we never look in the test data don't look into the test data for any of the steps that we're doing here you only look at them at the very final evaluation.
So ideally the test it should be kept in a vault and brought out only at the end of the data analysis and this is absolutely true.
So note that overfitting is extremely easy in neural networks and there's also a very nice publication showing that you can also solve image net if you start assigning random labels.
Cross validation so if you do repeated cross validation so repeatedly look into your test data set you may even overfit to a test data set although you're only looking into the evaluation just by random initialization you could have something some certain random seed that is potentially beneficial for a specific test data set.
So only look at the test data set at the very end so and if you don't then there can be a substantial difference between what you're actually estimating or a true unseen test data set so the test data set really what you want to make the out of bag yeah something some data that you have never seen before.
And if you don't do that properly you get heavily optimistic result I mean the worst thing that you can do is that you really include the test data and your training set because if you do so you will get super optimistic results in your test data set and it so even with very simple classifiers.
Let's just take the nearest neighbor classifier if you take the nearest neighbor classifier and the test data set is part of your training you always get 100% recognition rate so if you get something very close to 100% recognition rate then really make sure that you didn't do anything wrong that we are describing in this set of slides.
Yeah so also choose the architecture in the first element you never choose your architecture based on the observations of them from your test data you choose the architecture because you have done observation on your validation data set.
Okay and also initial experiments we strongly recommend to work with smaller data sets you only use a sub data set and obviously if you do debugging and there's still some error in your code and you're trying to fix something you never take the full data set because then you have to wait for many hours until the bug appears and obviously you want to use a very small data set just to make sure that everything you implemented actually works.
Okay some training strategies so for the very very first steps and let's say you have your own loss functions your own layer implementations the first thing you do is you check the correct computation of the gradient by comparing the analytic and numerical gradient.
Because these are the building blocks of your networks and if already one of your building blocks isn't doing correctly then obviously the entire system will be flawed will be affected and this may be really hard because we are using gradient descent here and you've already seen that subgradients or gradients that approximately point into the right direction can already yield somewhere.
So be really sure that you use a correct analytic gradient in your implementation check it with a numerical one.
So use centered differences for the numeric gradient you can use the relative error instead of absolute differences and then for numerics make sure that you use double precision for checking and if you have a lower precision obviously you need less memory and so on.
But for checking you should use double precision and then you should be able to observe very small values and then you can choose your age appropriately you know that's the step size for the numerical gradient checking.
Then additional recommendations only use a few data points because then you have less issues with the non differential parts of your loss function train the network for a short period of time and perform before performing gradient checks.
Check the gradient first without then with regularization terms so check that both of them work together and then turn off data augmentation and drop out when you do your when you do your gradient checks.
Okay good then next thing check initialization and loss so first thing is check random initialization of layers so you compute the loss for each class on the untrained network with regularization turned off.
And then you compare the loss when achieved when deciding for a class randomly so if your initialization should not produce very meaningful results and you compare that to a random label and check what comes out of that.
And then you repeat that with multiple random initializations so you want to make sure that your initialization is actually able to produce random class distributions and not only produces one class all of the time.
If your initialization always produces the same class you want to check your initialization.
Okay then next thing training so you check whether the architecture is in general capable to learn the task.
So before training the network on the full data set you take a small subset let's say only five to twenty samples and then you overfit the network to get a zero loss.
So just make sure that if you have a small data set then you should be able to overfit it to such a small sample so the essentially memorizing the entire training set but your network should be able to solve that task with a zero loss.
And if it doesn't then you should think about the architecture that you're using optionally you can of course also turn the regularization off your regularization may hinder the overfitting but make sure that you can actually overfit.
Because if you can't overfit then your model probably doesn't have enough degrees of freedom actually to learn this.
So if the if the network can't overfit then you may have a bug in the implementation.
It's always something you should consider also if you download something from somebody else.
They probably did a very good job but you should also keep in mind that there may also be a bug in the software that you didn't write on your own.
And that's typically really hard to find but keep that in mind it may happen.
Hopefully doesn't occur very frequently but it may happen and if you find it and fix it then please also share it with the rest of the world because then they will be tremendously happy.
So as we said already your model may be too small so you can increase the number of parameters or your model may simply not be suitable for the task.
If you have something like that that let's say the correct solution is simply not in the solution space that your model offers then you won't be able to solve the task as well.
Then obviously also get a get an idea about the data the loss and how the network behaves.
It's always good that you have some understanding how the data behaves how your loss behaves and how the network behaves.
So how do we do that? Well you can have a look at the loss curves over the iterations and here again it's really important that we point out these three curves here.
Because if you choose your step sizes to hide and you get this exploding gradient if you have two small step sizes you get this vanishing gradient.
Network doesn't train properly and you have to get an appropriate learning rate.
So please check that and then also see if there's large jumps in the curve.
You want to omit large jumps in the curve and should look like this so it should be smooth.
If you have something that is really alternating all the time and you have the loss curve is jumping up and down you may want to check what you're doing.
So this is not good when your loss curve jumps a lot and probably it might be that you have a very very unstable gradient computation or something like that that causes your loss to jump this much.
Okay so another feature that you can use if you have mini batches you can increase the batch size that can also help you to get smoother loss curves.
Another thing that you want to have a look at is obviously also the validation loss and you can use the validation loss to identify a good set of parameters.
Because you are minimizing the training loss right so the loss curve should go down until you reach essentially a local minimum.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:04:15 Min
Aufnahmedatum
2018-11-20
Hochgeladen am
2019-04-11 16:31:53
Sprache
en-US
Deep Learning (DL) has attracted much interest in a wide range of applications such as image recognition, speech recognition and artificial intelligence, both from academia and industry. This lecture introduces the core elements of neural networks and deep learning, it comprises:
-
(multilayer) perceptron, backpropagation, fully connected neural networks
-
loss functions and optimization strategies
-
convolutional neural networks (CNNs)
-
activation functions
-
regularization strategies
-
common practices for training and evaluating neural networks
-
visualization of networks and results
-
common architectures, such as LeNet, Alexnet, VGG, GoogleNet
-
recurrent neural networks (RNN, TBPTT, LSTM, GRU)
-
deep reinforcement learning
-
unsupervised learning (autoencoder, RBM, DBM, VAE)
-
generative adversarial networks (GANs)
-
weakly supervised learning
-
applications of deep learning (segmentation, object detection, speech recognition, ...)